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Go hunting in 
sequence 
databases but 
watch out for 
the traps 



The large amount of data created by 
world-wide sequencing efforts calls for 
automation in data handling and analysis. 
This requires accurate storage and updat- 
ing mechanisms as well as appropriate 
retrieval software. User-friendly interfaces 



arc also needed, as the number of 
researchers that access the information 
stored in public sequence databases is 
increasing considerably. Although the data- 
base teams are aware of the demands and 
the invaluable sequence databases are 
improving, they are also the product of his- 
tory and, like the accessing software, far 
from perfect. Thus, at present, working 
with sequence databases requires knowl- 
edge about their powers and their pitfalls. 
Here, we concentrate on some of the prob- 
lems that many users are unaware of, but 
that can have a considerable influence on 
the interpretation of the data. Some of the 
more frequent problems are summarized 
below, and some specific examples are 
given in Boxes 1 and 2. 

Problems within the sequence 

databases themselves 

Sequencing errors seem to be in the 
order of 0.1% (Ref. 1) (excluding ESTs, sin- 
gle reads with a very high error rate) affect- 
ing about 5% of the proteins 2 . When screen- 
ing 300 human proteins in SWISS-PROT that 
have been published separately more than 
once, we find that 0.3% of the amino acids 
are different; this is a lower limit as lots of 
corrections have already been done and as 
sequences appearing in two different publi- 
cations are often not independent. In any 
case, only frameshift errors and artificial stop 
codons can be detected unambiguously; 



point mutations are hard to verify as natural 
polymorphism or strain differences cannot 
easily be excluded. Even if this rate seems 
low, errors can accumulate in the sequence 
of interest (see Ref. 3 for an example) and 
can lead to functional misinterpretations. 
Moreover, although the quality of sequenc- 
ing is improving, budget calculations might 
favor quantity instead of quality in the near 
future; the successful strategies based on 
ESTs demonstrate that data quality and its 
interpretation remain a major issue. Errors of 
various sources are also a major problem 
for other molecular databases such as 
Brookhaven Protein Data Bank 4 . 

The processing of raw DNA by data- 
base management software is another seri- 
ous source of problems. For example, false 
translation of genomic DNA into gene 
products, having missed exons or trans- 
lated introns, leads to erroneous entries in 
protein sequence databases; the correct in- 
itiating methionine is not always chosen as 
a translation start; or ORFs translated from 
the opposite strand of the gene end up as 
proteins. The challenge is to improve pre- 
diction methods as the widely used algo- 
rithms of gene identification in higher 
eukaryotes have only an accuracy between 
60-70% (Ref. 5): nearly a third of the auto- 
matically predicted proteins from genomic 
DNA without clear homologs are expected 
to contain some serious errors. 

Erroneous annotation is also common, 
ranging from simple spelling errors to 



Box 1. Some arbitrarily chosen examples that demonstrate various kinds of pitfalls in database usage 



Synonyms J . . _ 

In organisms that are the target of major genetic studies, it orten 
happens that the same gene is isolated by many different 
groups and so it ends up with many different names. For exam- 
ple, yeast TUP! is also known as AER2, SFL2, CYC9, UMR7, 
AAR1, AMM1 and FLK1. In Escherichia co/i, Crisis also known 
as hnsA, drdX, osmZ, hglY, msyA, cur,pilGwd tops. The mul- 
tiplicity of synonyms also exists at the level of protein names. 
For example, annexin V was also called: lipocortin V, 
endonexin II, calphobindin I, placental anticoagulant protein 
I, pp4, thromboplastin inhibitor, vascular anticoagulant-alpha 
and anchorin CII. 

Different gene -same name 

Conversely, it often happens that the same gene name is given 
to two different genes. Generally one of these duplicate names 
is quickly changed, but in some cases the two gene names each 
find a lobby and are simultaneously promoted. For example, 
yeast MRF1 is both the gene for the mitochondrial peptide 
chain release factor 1 and for the mitochondrial respiratory 
function protein 1. A famous example is 'cyclin', the accepted 
name for a large family of cell-cycle components, which be- 
came so prominent that this name is no longer used for a pro- 
tein now known as proliferating cell nuclear antigen (originally 
called cyclin). 

SpeMng 

Even spelling mistakes can end up as gene synonyms. For 
example, the yeast gene, SCD25 (suppressor of CDC25X was 
so often misquoted as SDC25 that it has become an accepted 
synonym. In addition to spelling mistakes, database queries can 



be hindered by: differences between US and UK spelling (e.g. 
hemoglobin or haemoglobin); representation of special char- 
acters, such as accented characters (e.g. Kriippel, Krueppel or 
Kruppeiy, upper and lower cases (e.g. in the Drosophila genetic 
nomenclature, is Hairless, but his hairy). 

Biological source and contamination 

There are numerous problems with the annotation of the bio- 
logical source of a sequence. For example, the ORGanelle 
division of EMBL/GenBank division should only contain 
sequences that are encoded on the mitochondria or plastid. But 
often entries reporting nuclear-encoded genes for proteins tar- 
geted to such an organelle are wrongly entered in the ORG divi- 
sion. The converse is also true as some chloroplast or mito- 
chondrial encoded sequences are sometimes found in other 
divisions of EMBL/GenBank. This problem can have an effect 
on the derived protein sequence: if a nuclear-encoded mito- 
chondrial gene is misclassified into the ORG section, the result- 
ant translation will be wrong as the automatic translation soft- 
ware will assume that a mitochondrial genetic code should be 
used. The contamination of cDNA libraries (usually by fungal 
or bacterial DNA) is still an issue (a prominent example is one 
of Genethon's 'human' EST libraries that have a surprising num- 
ber of matches with the yeast genome). Some scientific sur- 
prises can result from these issues: it was found recently 
(Laurent Duret, pers. commun.) that the sequence of two, genes 
coding for annexin 1 and insulin from a sponge (their 'identi- 
fication' in lower eukaryotes was unexpected) were too closely 
related to their mamirialian homologs. It turned out that the bio- 
logical samples had most probably been contaminated by an 
undetermined rodent species. 
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discrepancies Mvvcen sequences published 
in printed form. Thi> should decrease with 
electronic submissions and quality check 
software. Sequences can also be incorrectly 
labelled because of contamination of the 
sequenced material (Box 1). 

A major annotation problem is the 
functional description of a gene. Due to 
the increasing number of 'software 
robots' 6 that automatically assign func- 
tions based on similarities, a single 
wrongly annotated entry will lead to 
whole families with artificial functions 
based on similarities to that entry [e.g. Ref. 
7 in which information was carefully trans- 
ferred from a gene called nifr3 to several 
unannotated proteins; whether nifr3 func- 
tions, however, in nitrogen fixation (nif) 
remains unclear, despite its annotation]. 
Automatic methods often cannot adapt to 
the dynamic annotation of entries and 
might, for example, point to neighboring 
genes in automatically translated protein 
sequences after a new ORF has been dis- 
covered in an already annotated region. 
Further annotation problems arise due to 
user interpretation (see below). 

The retrieval of data is often hindered 
by incorrect genetic nomenclature (Box 1). 
Even in relatively similar organisms, such 
as budding yeast and fission yeast, the 
same gene name actually points to non- 
homologous functionally different proteins 
(e.g. RAD4). Details in the syntax can often 
not be reflected in the databases. Major 
attempts have been made for classifying 
enzymes, but even here the functional clas- 



sification needs to be complemented by 
consistent homology and three-dimen- 
sional information. Nomenclature is also 
needed at the level of domains, structurally 
independent parts of proteins for which 
frequendy several names coexist in the lit- 
erature and consequently in the databases 8 . 
The different communities should force 
standardization. 

Finally, databases are not always up to 
date regarding the functional information 
or other annotated features because there 
is currently no systematic update mecha- 
nism. Due to the policy of some databases 
that only authors can change the content of 
an entry, followup characterizations of 
genes or gene products are, thus, only 
occasionally included. 

Problems of interpretation 

Numerous pitfalls are related to the inter- 
pretation of the results of the database- 
accessing software; simple problems arise 
if the retrieval system does not access the 
full dataset so that stored information is not 
found. Furthermore, the occasional user 
often only accesses the major sequence 
databases that contain information from all 
organisms. Many communities studying 
particular protein families or organisms 
know about specialized databases that 
contain much more information on par- 
ticular genes or proteins, but which is 
often not linked to the major databases. 
There is hope that this will change in the 
near future (e.g. links to FlyBase, YPD in 
SWISS-PROT). 



A battery of traps result from database 
similarity searches, probably the most 
prominent form of database access. For 
example, the user might have insufficient 
knowledge of the limits of the programs 
(e.g. 'homology' to the coiled-coil region 
of myosin that is due to similar structural 
constraints) and inadequate thresholds 
and parameters often prevent an objec- 
tive analysis. A different problem results 
from the pressure on sequencing groups, 
not to overlook interesting functional 
information in 'their' sequences. Thus, 
similarity search methods are stretched 
and spurious hits are taken as real. 
Moreover, similarities might only be 
restricted to certain domains, but the 
function is transferred to a whole protein. 
All such questionable interpretations end 
up in databases and are then considered 
as facts.. 

Finally, here is just one example that 
cjemonstrates the difficulties of functional 
predictions based on homology. Imagine 
the best hit to your Drosophila sequence 
is the human zinc-containing alcohol 
dehydrogenase class 4 \i/a (in databases 
mu/sigma) chain (ADH7). It is very dif- 
ficult to find out, whether the Drosophila 
sequence is the ortholog, another alcohol 
dehydrogenase, a homologous lactate 
dehydrogenase, a more distandy related 
oxidoreductase, or perhaps just a protein 
with an NADH-binding site. There is cur- 
rendy little quantification possible, in 
terms of functional similarity; a way out 
might be the knowledge of the complete 



Box 2. Unusual database entries \ 



'Protein' sequences in databases can be as short as one amino 
acid that is sometimes an X (as happens in the patent divisions 
of the databases) so that several database accession software 
packages have problems. Automatic DNA translation programs 
that contribute a considerable fraction of the protein sequences 
can also be mislead: the following TREMBL (automatic 
Translation of EMBL; version August 96) entry is a mistranslation 



(compare with the annotated CDS), probably due to some anno- 
tation problems in the corresponding EMBL entry. The name is 
also unusual: a human protein with an identifier starting with MM 
(usually meaning Mus musculus). It is supposed to encode a 
small region of trk4 but the translation comes up with parts of a 
different protein, MAC25. Detective work is needed to figure out 
the errors that lead to the wrong translation. 



ID MMTRK4A_1 standard; PRT; 118 AA. 

AC M55337; 

DR EMBL; M55337; MWTRK4A. 

DE gene: "trk4"; product: "oncogene tyrosine protein kinase receptor"; 

DE Human oncogene tyrosine protein kinase receptor (trk4) mRNA, 

DE partial cds. 

OS Homo sapiens (human) 

OC Eukaryota; Animal ia; Metazoa; Chordata; Vertebrata; Mammalia; 

OC Theria; Eutheria; Primates; Haplorhini; Catarrhini; Hominidae. 

FT CDS <1. ->354 

FT /gene="trk4" 

FT /note="NCBI gi:. 339916" 

FT /codon_start=l 

FT / product =" oncogene tyrosine protein kinase receptor" 

FT /db_xref="PID:g339916" 

FT /translation="HSIKDVHARLQALAQEQEFXEQEQEEQEGEEAATPSGGGRNRSAS 

FT SSWVGTMAGISMSLHFOTLGG^ 

FT LKWELGEGAFGKVF" 

CC translated using genetic code table "Standard" 

CC Warning: codon start shifted by 1 

CC Warning: illegal start codon 

SQ Sequence 118 BP; 

Mmtrk4a_l Length: 118 August 20, 1996 23:39 Type: P Check: 7282 
1 PLPPHPAMER PSLRALLLGA AGLLLLLLPL SSSSSSDTCG PCEPASCPPL 
51 PPLGCLLGET RDACGCCPMC ARGEGEPCGG GGAGRGYCAP GMECVKSRKR 
101 RKGKAGAAAG GPGVSGVC 



1 
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£t'nc pool of organisms from all major 
laxa. which will allow classification within 
multigene families via phylogenetic trees. 

Shared responsibilities 

Although the list of problematic issues 
is much longer, we wish to point out that 
sequence databases are the most useful 
tool in sequence analysis and the question 
should be how can one further improve 
their value by enhancing the data storage, 
handling and retrieval? How should the 
responsibility for this task be shared? 
Everybody who stores information should 
feel responsible for the data and the anno- 
tation quality. Database teams have a 
restricted budget and can only provide 
some quality checks (e.g. for cloning and 
sequencing errors, artificially translated 
vectors, repeats and so on). Databases rely 
on standards and these have also to come 
from the different communities in the form 



of agreed nomenclature and clearly repro- 
ducible functional characterizations. 
Specialists should spend the time to give 
feedback on encountered problems and 
database teams should have mechanisms 
to include such improvements. This is, of 
course, easily said, but opinions about data 
and annotation vary and the truth is not 
always obvious. In conclusion, a concerted 
effort is needed from the database teams 
that have to maximize their services and 
the user community that should share 
responsibilities in taking care of the qual- 
ity of the entries. 
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Meeting Reports 

Flies in Crete 

10th EMBO Workshop on the Molecuiar and Developmental Biology of Drosophiia, Kolymbari, Crete, 14-20 July 1996. 



The results presented at this meeting were 
enormously varied and informative and the 
comments below represent just a small 
sample of the interesting science that was 
presented. 

The importance of polarity within a 
single cell was illustrated several times. 
Transcripts of the segment polarity gene 
wingless (wg) were found to be apically 
localized in polarized epithelial cells (H. 
Krause, Toronto) and, interestingly, local- 
ization was required for wg function. In 
contrast, localized wg transcripts were not 
required for wg function in nonpolar mes- 
enchymal cells. Also, with regard to cellu- 
lar polarity, the inscuteable gene was 
shown to be required for the orientation of 
the mitotic spindle and, therefore, for the 
correct plane of cell division (W. Chia, 
Singapore). Remarkably, in neuroblasts 
INSCUTEABLE is apically localized and is 
required for the basal localization of the 
homeodomain protein, PROSPERO. Thus, 
INSCUTEABLE appears to be an important 
component of the positional information 
within a cell. 

Five major signaling pathways were 
discussed: hedgehog (hh), wingless (wg), 
decapentaplegic (dppX EGF and FGF. 
Perhaps not surprisingly, several intersec- 



tions and similarites between signaling 
pathways were apparent. For example, 
SMOOTHENED protein, which appears 
to be a G-coupled seven-transmembrane 
receptor, was suggested to be an HH re- 
ceptor (M. Noll, Zurich). Curiously, 
SMOOTHENED shares striking similarity 
to the FRIZZLED family of proteins, which 
are putative receptors for Wnt (e.g. WG) 
signals. Downstream of the WG signal 
might be HMG-domain transcription fac- 
tors related to mouse lymphoid enhancer- 
binding factor 1 (LEF1) (M. Bienz, 
Cambridge, UK, in collaboration with R. 
Grosschedl). LEF1 binds to ARMADILLO, 
another downstream component of the 
WG signal and, when expressed in 
Drosophila, phenocopies wg-overexpres- 
sion phenotypes. Therefore, the tantalizing 
hypothesis that LEF1 might be an 
ARMADILLO-activated nuclear target for 
WG signaling was suggested. Another mol- 
ecule downstream of the Wg signal is 
encoded by arrow. Surprisingly, arrow 
turns out to be identical to the gene cen- 
t rosomin which is a component of the cen- 
trosome (T. Kaufman, Bloomington and S. 
DiNardo, New York). How the produces) 
of a single gene can participate in two 
seemingly very different cellular processes 
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provides a curious puzzle. In the embry- 
onic endoderm, the homeodomain protein 
encoded by extradenticle t which binds to 
DNA cooperatively with HOX proteins, was 
shown to translocate from cytoplasm to 
nucleus in response to both DPP and WG 
signals (R. Mann, New York). Similarly, the 
novel protein encoded by Mothers against 
dppwas also shown to translocate from the 
cytoplasm into nuclei in response to DPP 
(W. Gelbart, Cambridge, USA). Thus, 
controlling protein localization within a 
cell might be a common response to extra- 
cellular signals. 

In addition to intersecting signaling 
pathways, combinations of different signals 
were shown to be important for the acti- 
vation of even-skipped expression in a small 
cluster of mesoderm cells (A. Michelson, 
Cambridge, USA). The selection of these 
mesoderm cells, which are founders for a 
subset of muscles, depends on intersecting 
fields of i^gand dpp expressing cells in the 
ectoderm, together with a RAS-dependent 
pathway (perhaps the EGF pathway). 
These signals define an equivalence group, 
which, as is the case for neuroblast selec- 
tion, is refined by the action of another set 
of signaling molecules encoded by the 
neurogenic genes, Notch and Delta. 
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